Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown
نویسنده
چکیده
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same roc curve.
منابع مشابه
On Mining Fuzzy Classification Rules for Imbalanced Data
Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...
متن کاملOn Mining Fuzzy Classification Rules for Imbalanced Data
Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...
متن کاملA Novel Approach for Handling Imbalanced Data in Medical Diagnosis using Undersampling Technique
In many data mining applications the imbalanced learning problem is becoming ubiquitous nowadays. When the data sets have an unequal distribution of samples among classes, then these data sets are known as imbalanced data sets. When such highly imbalanced data sets are given to any classifier, then classifier may misclassify the rare samples from the minority class. To deal with such type of im...
متن کاملConversion of Imbalanced Data Into A Stream Using SMOTE Algorithm
Machine learning approach has got major importance when distribution of data is unknown. Classification of data from the data set causes some problem when distribution of data is unknown. Characterization of raw data relates to whether the data can take on only discrete values or whether the data is continuous. In real world application data drawn from non-stationary distribution, causes the pr...
متن کاملData Mining for Imbalanced Datasets: An Overview
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs...
متن کامل